Back

Journal of Bioinformatics and Systems Biology

Fortune Journals

Preprints posted in the last 30 days, ranked by how well they match Journal of Bioinformatics and Systems Biology's content profile, based on 14 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier

Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.

2026-05-18 bioinformatics 10.64898/2026.05.15.725340 medRxiv
Top 0.1%
2.1%
Show abstract

Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.

2
Mutational and bioinformatic analysis of the binding site for the ribonucleotide reductase-specific transcriptional repressor NrdR

Shahid, S.; Lundin, D.; Rozman Grinberg, I.; Sjöberg, B.-M.

2026-05-14 molecular biology 10.64898/2026.05.11.724285 medRxiv
Top 0.1%
1.7%
Show abstract

The prevalent transcriptional repressor NrdR binds to highly conserved prokaryotic sequences in the promoter regions of operons encoding the essential enzyme ribonucleotide reductase. The NrdR binding sites consist of two partially palindromic 16 bp sequences (NrdR boxes) separated by a 15-16 bp linker sequence. We have assessed the requirement of both boxes for binding, the propensity of different NrdRs to bind to heterologous binding sites, and that the linker sequence is only limited to length and not sequence conservation. As we have observed several deviations from the conserved sequences of the NrdR boxes, we here test the conservation requirements of individual basepairs in the NrdR boxes using a synthetic DNA fragment (Synt DNA) to which the NrdR proteins from the actinomycete Streptomyces coelicolor and the gammaproteobacterium Escherichia coli bind equally well as to their homologous binding sites. By introducing isolated mutations to Synt DNA and testing the binding capacity of NrdR from S. coelicolor and E. coli we expand our understanding of what criteria are needed to build a functional binding site for the NrdR repressor.

3
Design of DNA Aptamers for Lyme disease Diagnosis Combining experimental and numerical approaches

GAYRAUD, G.; Davila Felipe, M.; Padiolleau-Lefevre, S.; Maffucci, I.; Issouani, E. M.; Guerin, M.; Da Ponte, H.

2026-05-15 bioinformatics 10.64898/2026.05.13.724892 medRxiv
Top 0.2%
1.5%
Show abstract

Aptamers are single stranded DNA or RNA molecules selected for their high affinity and specificity to bind target molecules, similar to antibodies. They are commonly selected through the SELEX process, which involves the iterative exposure of a random sequence library to a target and retaining the sequences showing good binding properties. To improve Lyme disease detection, we propose designing aptamers that specifically bind to the CspZ protein on the surface of Borrelia burgdorferi, the bacterium responsible for the disease. Starting with a SELEX process consisting of thirteen rounds, from which selected in vitro sequence candidates have emerged, we aim to propose a holistic process that selects in silico new sequence candidates that are further validated experimentally. Our approach relies on 1) using Machine Learning (ML) techniques, specifically a Restricted Boltzmann Machine (RBM), to digitally replicate the last round of the SELEX process, 2) integrating insights from text analysis methods, such as word2vec and n-grams, into the RBM model trained on the final-round SELEX dataset to represent and compare newly generated sequences with in vitro candidates, 3) selecting in silico sequences with strong potential to bind to CspZ protein, 4) experimentally validating the selected in silico sequences of step 3. Our holistic approach combines biological insights with statistical models to improve the efficiency and outcome of the SELEX process. We enhance the RBM model, designed to replicate the distribution of the final SELEX round, by integrating geometric representations of sequences, which is especially advantageous when dealing with limited datasets relative to the vast sequence space. In addition, it provides in silico sequence candidates with strong binding properties.

4
Genome-wide computational prediction of miRNAs encoded by influenza A virus (H3N2) predicts target genes involved in pulmonary and antiviral innate immunity

Siddiqi, M. A.; Kumar, H.; Mazumder, M.

2026-05-18 bioinformatics 10.64898/2026.05.18.725090 medRxiv
Top 0.3%
1.1%
Show abstract

Influenza A virus (IAV) causes significant morbidity and mortality worldwide. Understanding how viral RNAs may regulate host genes through microRNA-like mechanisms can clarify pathogenesis and reveal therapeutic targets. In this study, we screened all eight IAV H3N2 RNA segments (PB2, PB1, PA, HA, NP, NA, M, and NS) using an ab initio computational pipeline; five segments (PB2, PB1, PA, HA, and M) met the VMir scoring threshold for further analysis, while NP, NA, and NS were excluded due to low pre-miRNA scores. Mature miRNAs were identified using MatureBayes, and target genes in the human genome were predicted with the miRDB server. From these targets, we selected two genes per qualifying segment (10 genes total) based on their functional relevance to influenza infection and supporting literature; all selected genes are unique to their respective segment. We identified 10 segment-specific target genes (IFNL1, DDX60, SAMHD1, MAVS, IRF4, BIRC2, AGO1, MAP3K1, NOD1, and TNFAIP1) and one common target across all five analyzed segments (CADM2). Gene Ontology and pathway analyses showed enrichment in interferon signaling, RIG-I-like receptor pathways, antiviral restriction, RNA interference, and inflammatory responses. Literature supports roles for these genes in pulmonary and antiviral innate immunity. Our findings provide a basis for experimental validation and may help the research community better understand influenza virus pathogenesis and identify novel therapeutic candidates. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=111 SRC="FIGDIR/small/725090v1_ufig1.gif" ALT="Figure 1"> View larger version (33K): org.highwire.dtl.DTLVardef@2b14adorg.highwire.dtl.DTLVardef@5a9b2eorg.highwire.dtl.DTLVardef@81ffc1org.highwire.dtl.DTLVardef@be119b_HPS_FORMAT_FIGEXP M_FIG C_FIG

5
Simple Electroporation of Chlamydomonas reinhardtii Strains with an Intact Cell Wall

Messmer, M.; de Carpentier, F.; Lam, E.; Hong, M.; Wakao, S.; Schroda, M.; Niyogi, K. K.

2026-05-05 molecular biology 10.64898/2026.04.30.721989 medRxiv
Top 0.3%
1.1%
Show abstract

Chlamydomonas reinhardtii is a model green alga extensively used to study photosynthesis and cilia using molecular biology and genetics. Electroporation is a very common technique to transform DNA into the nuclear genome, which is essential to generate mutant collections and express transgenes. Here, we describe a simple, fast, and efficient protocol to transform strains with an intact cell wall. It achieves a good transformation efficiency without cell wall digestion or use of commercial kits and is compatible with the widely available Gene Pulser electroporation system. Key featuresO_LIHigh transformation efficiency of Chlamydomonas reinhardtii strains with an intact cell wall. C_LIO_LIFaster than currently available electroporation protocols. C_LI

6
MagNet: Computational Methods for Constructing High-Confidence Protein-Protein Interaction Networks in Magnaporthe oryzae

Kim, H.; Cheong, K.; Jeon, J.; Choi, G.; Koh, J.; Song, H.; Hue, Y.; Nam, Y.; Choi, B.; Lim, Y.-J.; Choi, J.; Kim, K.-T.; Lee, Y.-H.

2026-05-14 genomics 10.64898/2026.05.11.724438 medRxiv
Top 0.5%
0.8%
Show abstract

Magnaporthe oryzae, the rice blast fungus, plays a role as a model organism for molecular plant-microbe interaction research. Studies on the pathogenic mechanism of this fungus revealed many genes involved in signaling pathways. As multi-omics data are being available, genomic-level researches have been conducted to uncover the underlying biological processes during the pathogenesis of M. oryzae. Identifying the genome-wide protein-protein interaction (PPI) network is one of the omics-level approaches, which helps to understand signaling and regulatory pathways. However, existing biological network resources of M. oryzae are not sufficient to decipher pathogenesis mechanisms due to the abundance of false positives/negatives. In this study, a reliable PPI network database of M. oryzae, MagNet, was constructed with three methods, including homology-based Interolog search, co-expression network construction, and domain-domain interaction (DDI)-based prediction. With three approaches altogether, the pan-network with 5,600,976 interactions was generated, including 217,531 highly confident interactions supported by all three methods. Experimental data on M. oryzae PPIs supported that our PPI network can predict PPIs with higher accuracy compared to the previously constructed databases. MagNet would provide integrated biological network data, which can help to understand the molecular mechanisms of the rice blast fungus. The PPI data can be accessed via https:/magnet.scnu.ac.kr.

7
Physics-Informed Neural Networks for Parameter Recovery in the Repressilator Oscillatory Model

Casajuana, B.; Casals-Franch, R.; Lopez Garcia de Lomana, A.; Marti-Puig, P.; Villa-Freixa, J.

2026-05-15 bioinformatics 10.64898/2026.05.12.724679 medRxiv
Top 0.6%
0.8%
Show abstract

Parameter estimation in nonlinear biological dynamical systems is a difficult inverse problem because the governing equations are often stiff or oscillatory, the data are sparse and noisy, and the objective landscape is non-convex. Physics-informed neural networks (PINNs) offer an alternative to purely simulation-based calibration by representing state trajectories with neural networks while penalizing violations of the governing equations. This paper studies the empirical reliability of PINNs for recovering the parameters of the repressilator, a synthetic genetic oscillator formed by three cyclically repressive genes. We use synthetic time-series generated from the standard ordinary differential equation model and train inverse PINNs to estimate the production parameter {beta} and the Hill coefficient n. The study varies observation noise, partial observation of repressors, sampling density, sensitivity to initial parameter guesses, and the difference between stable and oscillatory regimes. The results show that PINNs can reconstruct trajectories accurately when the model structure is correct and the three repressors are observed, but parameter recovery is more fragile than trajectory fitting. Noise, sparse sampling, unobserved variables, and unfavorable initial guesses increase the risk of biased estimates. The stable regime is easier to reconstruct, whereas the oscillatory regime provides richer information but also exposes optimization sensitivity. These findings support PINNs as a useful reverse-engineering tool for small gene-regulatory ODE models, while highlighting the need for repeated runs, uncertainty reporting, and experimental designs that improve identifiability.

8
Genome-wide identification of rhabdoviral sequences in alfalfa (Medicago sativa L.)

Grinstead, S.; Nemchinov, L. G.

2026-05-22 genomics 10.64898/2026.05.20.726541 medRxiv
Top 0.7%
0.7%
Show abstract

We recently reported the identification of endogenous viral elements (EVEs) originating from the Caulimoviridae family within the alfalfa (Medicago sativa L.) genome. Our subsequent identification of ubiquitous rhabdoviral elements in infected and healthy alfalfa tissues by high throughput sequencing prompted us to suggest that the alfalfa genome might be populated with integrated rhabdoviruses as well. Bioinformatics analysis using 26 publicly available alfalfa genomes proved the suggestion accurate. We found multiple non-retroviral segments of the Rhabdoviridae family belonging to the genera Betanucleorhabdovirus and Betacytorhabdovirus that appeared to be stable constituents of the host genome. In that capacity they could potentially acquire functional roles in alfalfas development and response to environmental stresses. We believe this study reveals the first documented case of rhabdoviruses integrated into the alfalfa genome.

9
YY1 Binding Motif at Upstream of Rep/Cap Increases AAV Yield and Full Capsids

Ofusa, Y.; Nishio, S.; Enoki, T.; Mineno, J.; Ozawa, K.; Mizukami, H.; Ohba, K.

2026-05-22 microbiology 10.64898/2026.05.21.726733 medRxiv
Top 0.7%
0.7%
Show abstract

Adeno-associated virus (AAV) vectors are widely used in gene therapy, whereas low manufacturing efficiency and a large proportion of empty capsids are major obstacles. This study focused on the Yin Yang 1 (YY1) binding motif (YY1-motif) and investigated the effect of its presence or insertion at upstream of the Replicase (Rep)/Capsid Cap) gene on AAV vector production. We found that the YY1-motif incidentally presented in a Rep/Cap plasmid was associated with high vector production. We then designed several modified Rep/Cap (RC2) constructs. The YY1-motif insertion at the upstream of Rep/Cap gene increased vector yield in a repeat-number-dependent manner, and similar effects were not observed with other promoters insertion. Furthermore, the insertion of the YY1-motif reduced the amount of Cap protein per the same amount of full particle in supernatants on multiple serotypes, indicating the improvement in the empty/full capsid ratio. The YY1-motif insertion did not affect the AAV vector infectivity. These results denote that the YY1-motif has a universal regulatory function that optimizes the Rep/Cap expression balance, and simultaneously improves the production efficiency and full particle formation of AAV vectors. This finding could contribute to the development of highly efficient and high-quality AAV manufacturing processes.

10
Automatic Bevacizumab Response Prediction in Ovarian Cancer from Digital Pathology Images via Novel AI-based Computational Pipeline

Alsaiari, A.; Turki, T.; Taguchi, Y.-h.

2026-05-04 bioinformatics 10.64898/2026.04.29.721782 medRxiv
Top 0.8%
0.7%
Show abstract

Ovarian cancer is one of the gynecological cancer types, which, if metastasized and not detected early, can cause deaths among women. Therefore, there is a need to accurately predict drug responses to ovarian cancer. A gynecological pathologist inspects abnormality in tissues, followed by providing a report about patients; however, such a diagnostic process is (1) hard; (2) requires experience; and (3) time consuming. Moreover, existing tools are far from perfect. Hence, we present a computational pipeline to improve predicting drug response pertaining to ovarian cancer, derived as follows. First, we download digital pathology images pertaining to ovarian bevacizumab response from the cancer imaging archive repository. We employed histogram of oriented gradients to images, constructing feature vectors, provided to Fisher linear discriminant analysis to change the representation through dimensionality reduction. Then, we provide reduced-dimensionality data for regression analysis through support vector regression coupled with various kernels and calculating the area under the ROC curve (AUC). Experimental results against transformer-based models (ViT and Swin) and other deep learning (DL) models (VGG16, ResNet50, InceptionV3, MobileNetV2, and EfficientNetB6) demonstrate that our approach with radial kernel (named SVRD+R) yielded an AUC performance improvements of 17% against the best-performing transformer-based model (ViT) while obtaining an AUC performance improvements of 14.9% when compared against the best DL-based model (MobileNetV2). These results demonstrate the superiority and feasibility of our AI-based pipeline when tackling prediction problems pertaining to gynecologic cancer studies. MSC92B05; 68T09

11
Spurious correlation inflates performance in single-cell perturbation prediction

Nicol, P. B.; Shivakumar, S.; Irizarry, R.

2026-05-12 bioinformatics 10.64898/2026.05.07.723486 medRxiv
Top 0.8%
0.7%
Show abstract

The increasing number of computational methods designed to predict the effects of genetic perturbations on cellular gene expression profiles has led to a need for rigorous evaluation metrics. Recent benchmarking studies rely on correlation or cosine similarity of differential expression relative to a shared population of control cells. We show that these metrics are systematically inflated by statistical bias induced by reusing the same control population to define both quantities being compared. As a result, even non-informative methods can appear to perform well, particularly in datasets with limited numbers of control cells. Reanalysis of published datasets using a simple control-splitting procedure that removes this bias leads to a substantial reduction in performance previously attributed to biological signal.

12
Efficient Stochastic Trace Generation for Transcription

Ferdowsi, A.; Fuegger, M.; Nowak, T.

2026-05-08 bioinformatics 10.64898/2026.05.05.722871 medRxiv
Top 0.9%
0.6%
Show abstract

Bursty transcription in single cells typically produces over-dispersed, skewed, and sometimes heavy-tailed expression distributions that are explained by two-state Markov models of the promoters. While the gold standard for simulation is exact stochastic sampling with Gillespies algorithm, obtaining thousands of timed traces is computationally costly. Surrogate models based on stochastic differential equations (SDEs) are widely used to speed up this simulation process. An example is the Chemical Langevin Equation based on Gaussian noise, which, however, does not capture heavy-tailed noise. In this work, we present a unified SDE framework that combines deterministic drift, Gaussian fluctuations, and additive sporadic jumps of arbitrary distributions, and provide an open-source Python implementation, bcrnnoise. The framework subsumes standard surrogate models and allows for vectorized generation of batches of transcription traces. We assess computational speed and accuracy of common surrogate models along with new models, showing that high accuracy can be obtained while reducing computational cost up to two orders of magnitude.

13
In silico restriction site analysis of whole genome sequences shows patterns caused by selection and sequence duplications

Vedder, L.; Schoof, H.

2026-05-16 genomics 10.64898/2026.05.15.725336 medRxiv
Top 0.9%
0.6%
Show abstract

Biological sequences are known to be not random. Thus, the comparison of in silico restriction fragment distributions of random and biological sequences may be an indicator of this non-randomness. Our analyses show that for most of the tested combinations of restriction enzyme and genome sequence the fragments per Megabase of the biological sequence deviate at least more then 10% from the corresponding random sequence. This deviation goes into both directions, i.e. clearly increased values are as common as clearly decreased values. Although there is no species- or restriction-enzyme-specific effect, a clear impact of the GC content both of the restriction site and of the genome sequence can be seen. In contrast to the random sequences, the genome sequences show distinct peaks in their fragment length distributions, hinting to repetitive elements such as transposons.

14
Denoised MDS-UPDRS Part-III Scores Yield New Patterns of Progression Heterogeneity in Early Stage Parkinson's Disease

Koss, J.; Tinaz, S.; Tagare, H.

2026-05-08 bioinformatics 10.64898/2026.05.04.722810 medRxiv
Top 0.9%
0.6%
Show abstract

Parkinsons Disease (PD) Motor Scores (MDS-UPDRS Part III) are quite noisy. This paper proposes a new methodology for processing these scores by first denoising the scores to enhance the underlying progression signal, and then conducting a high-dimensional analysis which does not sum the scores into a total movement score. The analysis gives novel insights into PD progression heterogeneity: it reveals that the heterogeneity is continuously variable rather than clustered into "subtypes" and that the variability is along two easily understood axes. This analysis also resolves some of the discrepancies in previously reported progression subtypes. Finally, the analysis reveals that patient-specific progression cannot be predicted from baseline using only MDS-UPDRS Part III scores.

15
Integrating spatial and single-cell multi-omics analysis of induced pluripotent stem cell-derived cervical adenocarcinoma model

Kamata, S.; Taguchi, A.; Iuchi, H.; Ikeda, Y.; Maruyama, R.; Nakanishi, Y.; Sugi, T.; Okuma, Y.; Kobayashi, O.; Tomita, N.; Yoshimoto, D.; Wang, L.; Moritsugu, N.; Takahashi, C.; Tagami, M.; Matsunaga, H.; Okayama, T.; Manabe, R.-i.; Kiyotani, K.; Ikeo, K.; Okazaki, Y.; Kiyono, T.; Masuda, S.; Hamada, M.; Takeyama, H.; Kawana, K.

2026-05-06 cancer biology 10.64898/2026.05.01.722143 medRxiv
Top 0.9%
0.6%
Show abstract

Human papillomavirus 18 (HPV18) preferentially infects cervical stem cell-like cells and is strongly associated with adenocarcinoma. However, the mechanisms underlying differentiation into cervical adenocarcinoma remain unclear due to the lack of appropriate experimental models. We aimed to establish a model of HPV18-associated cervical adenocarcinoma and elucidate its molecular and cellular differentiation mechanisms. HPV18 E6/E7 were introduced into induced pluripotent stem cell-derived reserve cell-like cells (iRCs) to generate tumor models. Spatial transcriptomics and single-cell multi-omics analyses were performed to integrate histological and molecular data. A distinct component (Gland_A) exhibited morphological and immunohistochemical features of cervical adenocarcinoma and was efficiently induced in iRC-18 tumors. Gland_A showed increased chromatin accessibility and elevated expression of FOXA1, FOXA2, and ALDH1A1. Analysis of clinical samples confirmed enrichment of ALDH1A1 in HPV-associated adenocarcinomas. This model recapitulates key features of HPV18-associated cervical adenocarcinoma and provides insights into its differentiation mechanisms.

16
Viral non-coding RNA structure annotation and API-based data retrieval with Rfam and R2DT

Muston, P.; Triebel, S.; Nawrocki, E.; Ontiveros-Palacios, N.; Jandalala, I.; Sweeney, B.; Bateman, A.; Marz, M.; Petrov, A. I.; Madrigal, P.

2026-05-14 bioinformatics 10.64898/2026.05.10.724034 medRxiv
Top 1%
0.5%
Show abstract

Rfam is a comprehensive database of non-coding RNA (ncRNA) families providing curated sequence alignments, consensus secondary structures, and covariance models for thousands of RNA families. The database is essential for identifying structured non-coding RNAs in newly sequenced genomes and understanding RNA structure-function relationships. Here we present computational protocols for automated ncRNA annotation of viral genomes, and for programmatic interaction with Rfam through its RESTful API. We showcase genome-wide RNA structure visualization from a genome sequence and from a multiple sequence alignment by generating comprehensive 2D structure diagrams using newly developed features in R2DT. We also present practical examples for retrieving family metadata, downloading alignments, accessing secondary structures, and searching user sequences from the Rfam API. These methods enable researchers in virology and RNA biology to integrate Rfam data into custom bioinformatics pipelines, comparative analyses, and machine learning workflows.

17
SPIFEE - A pipeline for analyzing traces of live-cell fluorescence microscopy data

Hogendorn, C.; R. Aragon, I.; Dallon, S.; Batchelor, E.

2026-05-11 bioinformatics 10.64898/2026.05.06.723263 medRxiv
Top 1%
0.5%
Show abstract

To properly respond to their environment, cells adjust the activity of key regulatory proteins and rates of gene expression. Methods to detect and quantify these forms of regulatory dynamics in living cells are of central importance for understanding cellular signaling events in both physiological and pathological conditions. Current technologies in this field make use of fluorescent probes to track cell signaling dynamics. Although these technologies have been used for decades, challenges remain. In particular, the segmentation, tracking, and interpretation of single cell dynamic data are time-consuming, prone to subjective errors, and often lacking in standardization across experiments. Here, we present SPIFEE, a data pipeline that uses experiment-dependent parameters to smooth noise and quantify key features of fluorescence data from time-lapse imaging studies. Processing data in this manner enhances and accelerates quantification of live-cell gene and protein expression, simplifies data analysis, and facilitates hypothesis generation. Author SummaryCells adjust protein activity and gene expression levels over time to respond to changes in their environment, a process referred to as cell signaling dynamics. Quantifying cell signaling dynamics in living cells often uses fluorescent probes, such as green fluorescent protein (GFP) and its spectral variants, to track changes in gene expression or protein activity over time. Challenges inherent in analyzing fluorescence data from single cells stem from biological and experimental noise, time-consuming quantification, and subjective errors. To address these challenges, we developed a computational tool called Signal Processing and Integrated Feature Extraction (SPIFEE). The pipeline improves the quality of fluorescence data analysis by reducing noise and extracting signal features in a way that is both intuitive and objective. The pipeline provides more accurate, rapid, and unbiased quantification of time-lapse microscopy data.

18
Establishment of titration-based control of DNA replication in Escherichia coli

Adiego-Perez, B.; Fluit, D.; Ludwig, C.; Berger, M.; Hohlbein, J.; Staals, R. H.; ten Wolde, P. R.; van der Oost, J.; Claassens, N. J.; Olivi, L.

2026-05-09 molecular biology 10.64898/2026.05.06.723188 medRxiv
Top 1%
0.4%
Show abstract

Escherichia coli couples the initiation of DNA replication with cell size by modulating the activity of the replication initiator protein DnaA. The activity of DnaA is regulated by both its interconversion between an active and inactive form and its titration on binding sites on the chromosome. Whereas its interconversion has been thoroughly studied, the extent to which DnaA titration can control replication initiation is poorly understood. Here, we describe the control of E. coli DNA replication via titration by modulating the expression of an always-active DnaA variant in four growth conditions. While we obtained stable cell cycles during slow growth, faster growth associated with overlapping replication forks led to replicative instability and DNA damage. Overall, our results provide insights into the limits of titration-based systems in the control of genome replication and their potential role in the evolutionary trajectory of E. coli. Finally, this study provides design principles for a simplified, titration-only regulatory mechanism for DNA replication in synthetic cells.

19
An assessment of normalization and differential expression methods for miRNA-seq analysis using a realistic benchmark dataset

Aparicio-Puerta, E.; Baran, A. M.; Ashton, J. M.; Pritchett, E. M.; Gaca, A.; Becker, J.; Halushka, M. K.; Jun, S.-H.; McCall, M. N.

2026-05-13 bioinformatics 10.64898/2026.05.08.723923 medRxiv
Top 1%
0.4%
Show abstract

MicroRNAs are short noncoding RNAs that regulate gene expression and are commonly profiled by small RNA sequencing (miRNA-seq). Despite the widespread use of miRNA-seq, datasets are often analyzed with RNA-seq method such as DESeq2 or edgeR, which do not take into account the specific characteristics of miRNA-seq data. Here, we present a benchmark study of normalization and differential expression approaches using a realistic ground-truth dataset. By mixing mouse RNA of two organs, we generated expression trends while capturing biological and technical variability. Using monotonicity across the dataset and expected fold changes from the mixture design, we assessed normalization and differential expression methods. Normalization benchmarking showed that within-sample scaling, particularly Read Per Million (RPM), best preserved the expected monotonic trends, outperforming cross-sample methods such as TMM, rlog, and VST. These approaches sometimes recovered apparent monotonicity among abundant miRNAs, but inspection of individual profiles suggested likely over-correction. Regarding differential expression, edgeR consistently ranked among the best-performing methods across several metrics, including log2 fold-change estimation, with performance comparable to miRNA-seq-specific tools such as miRglmm and NBSR. DESeq2, edgeR-v4, and limma-based approaches tended to systematically underestimate log2 fold changes. Applying a common RPM-based normalization substantially improved the performance of cross-sample methods, highlighting the strong influence of normalization on differential expression analysis. Overall, our findings support within-sample scaling methods such as RPM for normalization, and edgeR, miRglmm, or NBSR for differential expression. The dataset has been made publicly available, providing a valuable resource for objective method comparison and future miRNA-seq software development.

20
Clustering Strategies Improve Structure-Preserving Visualization of Single-Cell RNA-seq Data with CBMAP

Alchaar, M.; Dogan, B.

2026-05-04 bioinformatics 10.64898/2026.04.30.721861 medRxiv
Top 1%
0.4%
Show abstract

Dimensionality reduction for visualization is a fundamental step in single-cell RNA sequencing (scRNA-seq) analysis due to the extremely high dimensionality of gene expression profiles. However, widely used nonlinear embedding techniques such as UMAP and t-SNE can introduce substantial distortions when projecting data into two-dimensional space, potentially altering global organization, local neighborhoods, and distance relationships in ways that may mislead downstream biological interpretation. In this study, we investigate the applicability of Clustering-Based Manifold Approximation and Projection (CBMAP) for the visualization of scRNA-seq data and systematically examine how clustering strategies influence the quality of the resulting embeddings. CBMAP was integrated with several clustering algorithms commonly used in single-cell analysis, including k-means, Leiden, HDBSCAN, Secuer, HGC, and FlowSOM. The resulting embeddings were evaluated using quantitative metrics that measure global, local, and distance-level structure preservation and were compared with widely used dimensionality reduction methods such as UMAP, t-SNE, and PaCMAP across multiple benchmark datasets. Our results demonstrate that the clustering stage plays a critical role in determining the structural fidelity of CBMAP embeddings. Clustering algorithms specifically designed for single-cell transcriptomic data, particularly Secuer, produced more consistent preservation of global relationships between cell populations. Across multiple datasets, CBMAP more faithfully preserved global structural organization and inter-population distance relationships than the compared methods, although local neighborhood preservation was generally weaker than in techniques optimized for local structure. Importantly, CBMAP embeddings retained biologically meaningful relationships in trajectory benchmark datasets. When combined with RNA velocity analysis, CBMAP successfully preserved cyclic progenitor states and branching differentiation trajectories, demonstrating compatibility with trajectory-aware visualization. These findings indicate that CBMAP provides a structure-faithful visualization framework for scRNA-seq data and that clustering selection plays a central role in determining embedding quality.